Hierarchical Web Document Classification Based on Hierarchically Trained Domain Specific Words

نویسنده

Jing-Shin Chang

چکیده

Search engines return thousands of pages per query. Many of them are relevant to the“query words”but not interesting to the “users”due to different domain-specific meanings of the query terms. Re-classification of the returned documents based on domain specific meanings of the query terms would therefore be most effective. A cross domain entropy (CDE) measure is proposed to extract characteristic domain specific words (DSW’s) for each node of existing hierarchical web document trees. Domain specific class models are built based on the respective DSW’s. Such class models are then used for directly classifying new documents into the hierarchy, instead of using hierarchical clustering techniques. High accuracy can be achieved with very few domain specific words. With only the top 5~10% DSW’s and a maximum entropy based classifier, 99% accuracy is observed when classifying documents of a news web site into 63 domains. The precision and recall of the extracted domain specific words are also higher than those extracted with conventional TF-IDF term weighting method.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora

Domain specific words and ontological information among words are important resources for general natural language applications. This paper proposes a statistical model for finding domain specific words (DSW s) in particular domains, and thus building the association among them. When applying this model to the hierarchical structure of the web directories node-by-node, the document tree can pot...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

Mining Domain Specific Words from Web Documents

Web pages provide not only plain text materials for training language models but also tag information for semantics annotation. The tags could be found either explicitly in the HTML documents or implicitly through the directory hierarchy of the documents, since the directory hierarchy can be regarded as a kind of classification tree for web documents, which assigns an implicit hidden tag to eac...

متن کامل

Hierarchical Organization and Description of Music Collections at the Artist Level

As digital music collections grow, so does the need to organizing them automatically. In this paper we present an approach to hierarchically organize music collections at the artist level. Artists are grouped according to similarity which is computed using a web search engine and standard text retrieval techniques. The groups are described by words found on the webpages using term selection tec...

متن کامل

ارائه روشی برای استخراج کلمات کلیدی و وزن‌دهی کلمات برای بهبود طبقه‌بندی متون فارسی

Due to ever-increasing information expansion and existing huge amount of unstructured documents, usage of keywords plays a very important role in information retrieval. Because of a manually-extraction of keywords faces various challenges, their automated extraction seems inevitable. In this research, it has been tried to use a thesaurus, (a structured word-net) to automatically extract them. A...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2009

Hierarchical Web Document Classification Based on Hierarchically Trained Domain Specific Words

نویسنده

چکیده

منابع مشابه

Domain Specific Word Extraction from Hierarchical Web Documents: A First Step Toward Building Lexicon Trees from Web Corpora

A New Document Embedding Method for News Classification

Mining Domain Specific Words from Web Documents

Hierarchical Organization and Description of Music Collections at the Artist Level

ارائه روشی برای استخراج کلمات کلیدی و وزن‌دهی کلمات برای بهبود طبقه‌بندی متون فارسی

عنوان ژورنال:

اشتراک گذاری